comprehensive evaluation
- Asia > China > Zhejiang Province > Ningbo (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
Toward a Stable, Fair, and Comprehensive Evaluation of Object Hallucination in Large Vision-Language Models
Given different instructions, large vision-language models (LVLMs) exhibit different degrees of object hallucinations, posing a significant challenge to the evaluation of object hallucinations. Overcoming this challenge, existing object hallucination evaluation methods average the results obtained from a set of instructions. However, these methods fail to provide consistent evaluation across instruction sets that generate image descriptions of significantly different lengths. In this paper, we present the first systematic investigation of the effect of instructions on object hallucinations in LVLMs, with a specific focus on the role played by image description lengths. A valuable finding is that instructions indirectly affect hallucinations through the length of image descriptions.
An Electrocardiogram Multi-task Benchmark with Comprehensive Evaluations and Insightful Findings
Xu, Yuhao, Lu, Jiaying, Ding, Sirui, Cao, Defu, Hu, Xiao, Yang, Carl
In the process of patient diagnosis, non-invasive measurements are widely used due to their low risks and quick results. Electrocardiogram (ECG), as a non-invasive method to collect heart activities, is used to diagnose cardiac conditions. Analyzing the ECG typically requires domain expertise, which is a roadblock to applying artificial intelligence (AI) for healthcare. Through advances in self-supervised learning and foundation models, AI systems can now acquire and leverage domain knowledge without relying solely on human expertise. However, there is a lack of comprehensive analyses over the foundation models' performance on ECG. This study aims to answer the research question: "Are Foundation Models Useful for ECG Analysis?" To address it, we evaluate language/general time-series/ECG foundation models in comparison with time-series deep learning models. The experimental results show that general time-series/ECG foundation models achieve a top performance rate of 80%, indicating their effectiveness in ECG analysis. In-depth analyses and insights are provided along with comprehensive experimental results. This study highlights the limitations and potential of foundation models in advancing physiological waveform analysis. The data and code for this benchmark are publicly available at https://github.com/yuhaoxu99/ECGMultitasks-Benchmark.
SurveyEval: Towards Comprehensive Evaluation of LLM-Generated Academic Surveys
Zhao, Jiahao, Zhang, Shuaixing, Xu, Nan, Wang, Lei
LLM-based automatic survey systems are transforming how users acquire information from the web by integrating retrieval, organization, and content synthesis into end-to-end generation pipelines. While recent works focus on developing new generation pipelines, how to evaluate such complex systems remains a significant challenge. To this end, we introduce SurveyEval, a comprehensive benchmark that evaluates automatically generated surveys across three dimensions: overall quality, outline coherence, and reference accuracy. We extend the evaluation across 7 subjects and augment the LLM-as-a-Judge framework with human references to strengthen evaluation-human alignment. Evaluation results show that while general long-text or paper-writing systems tend to produce lower-quality surveys, specialized survey-generation systems are able to deliver substantially higher-quality results. We envision SurveyEval as a scalable testbed to understand and improve automatic survey systems across diverse subjects and evaluation criteria.
- Asia > China (0.17)
- North America > United States (0.14)
Fine-Grained GRPO for Precise Preference Alignment in Flow Models
Zhou, Yujie, Ling, Pengyang, Bu, Jiazi, Wang, Yibin, Zang, Yuhang, Wang, Jiaqi, Niu, Li, Zhai, Guangtao
The incorporation of online reinforcement learning (RL) into diffusion and flow-based generative models has recently gained attention as a powerful paradigm for aligning model behavior with human preferences. By leveraging stochastic sampling via Stochastic Differential Equations (SDEs) during the denoising phase, these models can explore a variety of denoising trajectories, enhancing the exploratory capacity of RL. However, despite their ability to discover potentially high-reward samples, current approaches often struggle to effectively align with preferences due to the sparsity and narrowness of reward feedback. To overcome this limitation, we introduce a novel framework called Granular-GRPO (G$^2$RPO), which enables fine-grained and comprehensive evaluation of sampling directions in the RL training of flow models. Specifically, we propose a Singular Stochastic Sampling mechanism that supports step-wise stochastic exploration while ensuring strong correlation between injected noise and reward signals, enabling more accurate credit assignment to each SDE perturbation. Additionally, to mitigate the bias introduced by fixed-granularity denoising, we design a Multi-Granularity Advantage Integration module that aggregates advantages computed across multiple diffusion scales, resulting in a more robust and holistic assessment of sampling trajectories. Extensive experiments on various reward models, including both in-domain and out-of-domain settings, demonstrate that our G$^2$RPO outperforms existing flow-based GRPO baselines, highlighting its effectiveness and generalization capability.
- Research Report (0.82)
- Workflow (0.68)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Comprehensive Evaluation of Prototype Neural Networks
Schlinge, Philipp, Meinert, Steffen, Atzmueller, Martin
Prototype models are an important method for explainable artificial intelligence (XAI) and interpretable machine learning. In this paper, we perform an in-depth analysis of a set of prominent prototype models including ProtoPNet, ProtoPool and PIPNet. For their assessment, we apply a comprehensive set of metrics. In addition to applying standard metrics from literature, we propose several new metrics to further complement the analysis of model interpretability. In our experimentation, we apply the set of prototype models on a diverse set of datasets including fine-grained classification, Non-IID settings and multi-label classification to further contrast the performance. Furthermore, we also provide our code as an open-source library (https://github.com/uos-sis/quanproto), which facilitates simple application of the metrics itself, as well as extensibility -- providing the option for easily adding new metrics and models.
- Asia > China > Zhejiang Province > Ningbo (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
Evaluating the Effectiveness of Cost-Efficient Large Language Models in Benchmark Biomedical Tasks
Jahan, Israt, Laskar, Md Tahmid Rahman, Peng, Chun, Huang, Jimmy
This paper presents a comprehensive evaluation of cost-efficient Large Language Models (LLMs) for diverse biomedical tasks spanning both text and image modalities. We evaluated a range of closed-source and open-source LLMs on tasks such as biomedical text classification and generation, question answering, and multimodal image processing. Our experimental findings indicate that there is no single LLM that can consistently outperform others across all tasks. Instead, different LLMs excel in different tasks. While some closed-source LLMs demonstrate strong performance on specific tasks, their open-source counterparts achieve comparable results (sometimes even better), with additional benefits like faster inference and enhanced privacy. Our experimental results offer valuable insights for selecting models that are optimally suited for specific biomedical applications.
Toward a Stable, Fair, and Comprehensive Evaluation of Object Hallucination in Large Vision-Language Models
Given different instructions, large vision-language models (LVLMs) exhibit different degrees of object hallucinations, posing a significant challenge to the evaluation of object hallucinations. Overcoming this challenge, existing object hallucination evaluation methods average the results obtained from a set of instructions. However, these methods fail to provide consistent evaluation across instruction sets that generate image descriptions of significantly different lengths. In this paper, we present the first systematic investigation of the effect of instructions on object hallucinations in LVLMs, with a specific focus on the role played by image description lengths. A valuable finding is that instructions indirectly affect hallucinations through the length of image descriptions.
Realistic Evaluation of TabPFN v2 in Open Environments
Cheng, Zi-Jian, Jia, Zi-Yi, Zhou, Zhi, Li, Yu-Feng, Guo, Lan-Zhe
Tabular data, owing to its ubiquitous presence in real-world domains, has garnered significant attention in machine learning research. While tree-based models have long dominated tabular machine learning tasks, the recently proposed deep learning model TabPFN v2 has emerged, demonstrating unparalleled performance and scalability potential. Although extensive research has been conducted on TabPFN v2 to further improve performance, the majority of this research remains confined to closed environments, neglecting the challenges that frequently arise in open environments. This raises the question: Can TabPFN v2 maintain good performance in open environments? To this end, we conduct the first comprehensive evaluation of TabPFN v2's adaptability in open environments. We construct a unified evaluation framework covering various real-world challenges and assess the robustness of TabPFN v2 under open environments scenarios using this framework. Empirical results demonstrate that TabPFN v2 shows significant limitations in open environments but is suitable for small-scale, covariate-shifted, and class-balanced tasks. Tree-based models remain the optimal choice for general tabular tasks in open environments. To facilitate future research on open environments challenges, we advocate for open environments tabular benchmarks, multi-metric evaluation, and universal modules to strengthen model robustness. We publicly release our evaluation framework at https://anonymous.4open.science/r/tabpfn-ood-4E65.
- Oceania > Australia > New South Wales (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- North America > United States > Nebraska (0.04)
- (9 more...)
- Information Technology (0.92)
- Energy (0.67)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.46)